{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 数据增强的作用\n",
    "- 增加数据量\n",
    "- 增加鲁棒性\n",
    "- 对于不平衡的数据\n",
    "- 目标分解\n",
    "    - 将困难度不同的任务拆分"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 词汇替换"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- 基于同义词的替换，如 WordNet\n",
    "- 基于词向量的替换，利用预训练的词向量，如 Word2Vec，GloVe，FastText 等训练得到的词向量，使用空间上最相邻的单词替换句子中的单词\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-09T11:53:55.542181Z",
     "start_time": "2020-06-09T11:53:55.528467Z"
    }
   },
   "outputs": [],
   "source": [
    "import gensim.downloader as api\n",
    "model = api.load(\"glove-twitter-25\")\n",
    "model.most_similar(\"awesome\", topn=5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- Masked Language Model：使用 BERT，ROBERT，ALBERT 等模型，屏蔽文本的某些部分，然后预测被遮蔽掉的单词，作为替换词"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-09T11:59:50.661848Z",
     "start_time": "2020-06-09T11:59:27.307309Z"
    }
   },
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "6c07cf9871b3478c8aaf812f480aab2f",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, description='Downloading', max=230.0, style=ProgressStyle(description_…"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "[{'sequence': '<s> This is pretty cool</s>',\n",
       "  'score': 0.5154130458831787,\n",
       "  'token': 1256},\n",
       " {'sequence': '<s> This is really cool</s>',\n",
       "  'score': 0.11662467569112778,\n",
       "  'token': 269},\n",
       " {'sequence': '<s> This is super cool</s>',\n",
       "  'score': 0.07387510687112808,\n",
       "  'token': 2422},\n",
       " {'sequence': '<s> This is kinda cool</s>',\n",
       "  'score': 0.04272918030619621,\n",
       "  'token': 24282},\n",
       " {'sequence': '<s> This is very cool</s>',\n",
       "  'score': 0.034715890884399414,\n",
       "  'token': 182}]"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from transformers import pipeline\n",
    "nlp = pipeline(\"fill-mask\")\n",
    "nlp(\"This is <mask> cool\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 文本转换\n",
    "- 使用正则表达式的模式匹配的转换，如将动词形式缩写转换成完整的形式"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 反向翻译"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- 把一些句子(如英语)翻译成另一种语言，如法语\n",
    "\n",
    "- 将法语句子翻译回英语句子。\n",
    "\n",
    "- 检查新句子是否与原来的句子不同。如果是，那么我们使用这个新句子作为原始文本的数据增强。  \n",
    "\n",
    "\n",
    "对于反向翻译的实现，可以使用TextBlob。或者，你也可以使用Google Sheets"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 随机噪声注入\n",
    "- 在文本中加入噪声，如在一些随机单词上添加拼写错误\n",
    "- Unigram噪声，从单字符频率分布中采样的单词进行替换\n",
    "- 句子打乱"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# UDA(Unsupervised Data Augmentation)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "原始链接：https://github.com/google-research/uda\n",
    "\n",
    "半监督学习，一半数据有标注一半数据未标注，\n",
    "- 标注数据训练模型\n",
    "- 未标注数据 A，并将其数据扩充得到数据集 B，\n",
    "- 利用标注数据得到的模型，分别对 A 和 B 进行预测，以两者的差别作为损失；理论上 A 和 B 的预测结果应该相同"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.7"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": true,
   "toc_window_display": false
  },
  "varInspector": {
   "cols": {
    "lenName": 16,
    "lenType": 16,
    "lenVar": 40
   },
   "kernels_config": {
    "python": {
     "delete_cmd_postfix": "",
     "delete_cmd_prefix": "del ",
     "library": "var_list.py",
     "varRefreshCmd": "print(var_dic_list())"
    },
    "r": {
     "delete_cmd_postfix": ") ",
     "delete_cmd_prefix": "rm(",
     "library": "var_list.r",
     "varRefreshCmd": "cat(var_dic_list()) "
    }
   },
   "types_to_exclude": [
    "module",
    "function",
    "builtin_function_or_method",
    "instance",
    "_Feature"
   ],
   "window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
